 |
 |
XML for the absolute beginner
A guided tour from HTML to processing XML with Java

Printer-friendly
version | Mail this to a friend
Page 10 of 10
Become a tree surgeon! One
final, somewhat more advanced topic, before we close. The SAX interface
allows you to parse an XML file and execute particular actions whenever
certain structures (like tags) appear in the input. That's great for a lot
of applications. There are times, though, when you want to be able to cut
and paste whole sections of XML documents, restructure them, or maybe even
build from scratch an object structure like the one in Figure 3, and then
save the whole structure as an XML file. For that, you need access to the
DOM API.
The DOM API allows you to represent your XML document as a tree of
nodes in your Java (or other language) program. While a SAX parser reads
an XML file, doing callbacks to a user-defined class, a DOM
parser reads an XML file and returns a representation of the file as
a tree of objects, most of which are of type org.w3c.dom.Node
This gives you immense power in manipulating structured documents. Figure
4 is an example of what I'm talking about.
|
Figure 4. A DOM document transformation system
|
The Document Object Model, in the package org.w3c.dom,
defines interfaces for document elements (that is, tags), DTD elements,
text nodes (where the actual text inside the tags is kept), and quite a
few other things we haven't even discussed. Figure 4 is a schematic of a
general system that can transform one XML document to some other form
programmatically. Your program uses a DOM parser to parse an XML file, and
the parser returns a tree that is an exact representation of the XML in
the file. Note that, at this point, you've read an input file, checked it
for formatting and semantic validity, and built a complex hierarchical
object structure, all in just a few lines of code. You can then traverse
the document tree in software, doing whatever you like to the tree
structure. Add nodes, delete them, update their values, read or set their
attributes -- basically anything you like. When your tree has the new
structure you desire, tell the top node to print itself to another XML
file, and the new document is created.
XML-Java synergy One of the reasons Java and XML
are so well-suited for one another is that Java and XML are both
extensible: Java through its class loaders, XML through its DTD. Imagine a
server, reading and writing XML, where the DTD for the system input can
change. When a new element is added to the input language, a running
server (written in Java) could automatically load new Java classes to
handle the new tags. You would not only have an extensible application
server -- you wouldn't even have to take the server down to add the
extensions!
One small idea points to the possible implementations of XML and Java
together. The next section is about a company whose combination of XML and
Java is its core technology.
XML with Java in the real world
You now have a handle on XML technology, including how
it's implemented in Java. You understand that a document can be viewed as
a tree of objects and manipulated using SAX or DOM. Let's have a look at a
real company that is using all of these technologies to provide solutions
for its clients.
DOM interfaces exist not only for XML, but for HTML, as well. This
means that the leftmost document in Figure 4 could be a Web page from
which you wish to extract information for manipulation in Java.
In fact, Epicentric, an Internet startup in San Francisco, does just
that. Epicentric uses Java and XML in its turnkey systems to allow
creation of custom portal sites. Portal sites, like the front
pages of Netscape Netcenter and Excite!, are integrated aggregations of
information from various Internet sources. In a corporate Internet
environment, a portal may contain information gleaned from external Web
pages (for example, weather reports), alongside internal enterprise data.
Portals are also often customizable by each user.
Epicentric's systems read HTML from the Internet as DOM documents,
extract information from those documents, and store that information in a
standard XML format. Other information sources are also converted into
this same XML format and stored on Epicentric's server. The company then
uses the XML with XSL and Java Server Pages to create custom portals for
its clients.
"A lot of good work has been done on the basics ... like parsers and
XSL processors," says Ed Anuff, CEO of Epicentric. One benefit of using
XML is that it makes designers think through the system structure in a
very structured way, Anuff says.
When asked about concerns with XML, Anuff states that many of the
problems he runs into are architectural, such as which DTD to use, and
designating the appropriate places in the system to use XML. Systems
designers are still working out how to use this new technology most
effectively in an enterprise environment.
Also, since the technology is so new, it's often hard to know what
pieces of the system to build in-house. For example, quite a few companies
built their own XML parsers but now have little return on investment
because larger companies are developing superior XML technology and giving
it away for free. "The biggest challenge today is figuring out when you're
reinventing the wheel, and when you're adding value," says Anuff.
Despite these challenges, the future looks bright for Epicentric, which
has several "pretty decent-sized customers" using the company's software
in beta. With clients and advertisers that include the likes of Eastman
Kodak Company, Sun Microsystems, Chase Bank, and LIFE Magazine, Epicentric
is using XML to aggregate and redistribute information in novel ways.
Conclusion XML is a powerful
data representation technology for which Java is uniquely well-suited.
You're going to be hearing a lot about XML in the coming months and years.
Anyone working with information systems that communicate with other
systems (and what systems don't, these days?) has a lot to gain by
understanding XML technology and using it to its full advantage.
Using XML with XSL or CSS, you can manage your Web site's content and
style, and change style in one place (the style sheet) instead of editing
piles of HTML files or, worse, editing the scripts that produce HTML
dynamically. Using SAX or DOM, you can treat Web documents as object
structures and process them in a general and clean way. Or, you can leave
browsers behind entirely and write pure-Java clients and servers that talk
to each other -- and other systems -- in XML, the new lingua
franca of the Internet. Sun Microsystems, the creator of Java, has
perhaps best described the power of XML and Java together in its slogan:
Portable Code -- Portable Data. Start experimenting with XML in Java, and
you'll soon wonder how you ever lived without it.
Thanks to Dave Orchard for his comments on drafts of this article,
and to the many helpful people I met in San Jose, CA.
Page 1 XML
for the absolute beginner Page 2 HTML:
All form and no substance Page 3 An
XML conceptual example Page 4 Make
up a markup Page 5 So,
what good is made-up markup? Page 6 Cascading
Style Sheets: not just for HTML anymore Page 7 XSL:
I like your style Page 8 Modeling
information structure in XML Page 9 XML
and Java Page 10 Become a tree surgeon!
Printer-friendly
version | Mail this to a friend
About the author Mark Johnson lives in Fort Collins, CO, and is a C++
programmer by day and Java columnist by night. Very late night.
Resources There are so
many XML resources on the Web, I've had to categorize. The first section
here is the most useful, since the documents are either high-level
summaries or excellent link sites. Apologies to anyone who was omitted.
XML and Java: General XML resources
- "XML, Java and the Future of the Web," Jon Bosak. The paper that
started it all, at least from a Java programmer's point of view.
Definitely worth a read, even if it's a bit dated. Jon is commonly
considered to be the father of XML. Funny how all of these technologies
seem to have paternity:
http://metalab.unc.edu/pub/sun-info/standards/xml/why/xmlapps.html
- "Media-Independent Publishing: Four Myths about XML" Jon Bosak:
http://metalab.unc.edu/pub/sun-info/standards/xml/why/4myths.htm
- Robin Cover's XML-SGML site is, according to my SGML buddies, the
bible of XML resources:
http://www.oasis-open.org/cover/
- The W3C's XML resource page lets you cheer from the sidelines as XML
technology proposals develop into recommendations, or join in the fray
on their active mailing lists:
http://www.w3.org/XML/
- OASIS, the Web site of the Organization for the Advancement of
Structured Information Standards, offers general news and information
about XML:
http://www.oasis-open.org/
- The Graphics Communications Association, host of the XTech '99
conference (March 11 to 13, 1999, San Jose, CA) and the upcoming XML
Europe '99 conference in Granada, Spain, (April 26 to 30, 1999) has a
Web site packed with XML information:
http://www.gca.org/
- XML.com is great for watching trends and digging up XML news:
http://www.xml.com/
- Textuality hosts Tim Bray's site. Check it out for a look at the
"big picture" of how XML fits into the structured document universe --
and for a look at Lark, Tim's nonvalidating XML processor:
http://www.textuality.com/
- The XML FAQ:
http://www.ucc.ie/xml/
- IBM's XML Website is an outstanding supplement to alphaWorks:
http://www.software.ibm.com/xml/index.html
XML and Java
- "XML and Java: The Perfect Pair" by Ken Sall (Internet.com, November
1998) provides information about XML, Java, and why these two are a
match made in heaven:
http://wdvl.com/Authoring/Languages/XML/Java/index.html
Tutorials and training
- Generally Markup, Richard Lander's Web site may be of interest to
you if you haven't yet read enough about markup languages:
http://pdbeam.uwaterloo.ca/~rlander/
- The Mulberry Technologies Web site is a good resource for commercial
training in XML, as well as general XML and SGML consulting by seasoned
SGML experts:
http://www.mulberrytech.com/
- The Web Developer's Virtual Library Series on XML offers good
summaries of various XML technologies, as well as annotated indices of
XML software:
http://wdvl.com/Software/XML
- Microsoft's Site Builder Network provides a series of articles
called "Extreme XML," one of which appears in the following link. While
some of it focuses on Microsoft-only, Windows-only technology, there's
still some great stuff here:
http://www.microsoft.com/sitebuilder/magazine/xml.asp
- Webmonkey has a good series of articles introducing readers to XML.
The index is at:
http://www.hotwired.com/webmonkey/xml/?tw=xml
- "What the ?xml!" by L.C. Rees offers an interesting take on XML and
why it's necessary -- nicely written and entertaining to boot:
http://www.geocities.com/SiliconValley/Peaks/5957/wxml.html
- "The XML Revolution" by Dan Connolly is a quick backgrounder on XML
(Nature):
http://helix.nature.com/webmatters/xml.html
Cascading Style Sheets
- W3C's CSS page will get your started learning about CSS:
http://www.w3.org/Style/CSS/
- "Cascading Style Sheets Designing for the Web" by Hakom Wium Lie and
Bert Bos (Addison-Wesley, 1997) Sample chapters from the book appear at:
http://www.awl.com/cseng/titles/0-201-41998-X/liebos/
Extensible Style Language (XSL)
- The W3C's XSL page:
http://www.w3.org/Style/XSL/
- Read (and comment on) the W3C's XSL Working Draft (currently dated
December 16, 1998):
http://www.w3.org/TR/WD-xsl
- "The Extensible Style Language: Styling XML Documents"
(WebTechniques Magazine) XSL tutorial information and examples:
http://www.webtechniques.com/features/1999/01/walsh/walsh.shtml
- Microsoft's XML and XSL tutorial site is especially interesting
because of the recent release of client-side XSL in Internet Explorer
5.0. Extensive and excellent:
http://www.microsoft.com/xml
- If you're still using IE 4.0, you can still experiment with XML,
using Microsoft's internal DOM:
http://www.microsoft.com/xml/articles/xmlmodel.asp
- If you want to experiment with XSL, try downloading IBM's LotusXSL.
It's all Java, and for the time being, it's free:
http://www.alphaworks.ibm.com/tech/LotusXSL
- Or, you can try James Clark's XT XSL engine, downloadable from:
http://www.jclark.com/xml/xt.html
Upcoming XSL contest
Though the details aren't yet worked out, Sun Microsystems will soon
announce a call for proposals for a $30,000 grant to develop a
client-side processor for full XSL implementation in Mozilla.
It will also announce, in conjunction with Adobe, a contest (first prize
$40,000, second prize $20,000) to develop a pure-Java, server-side
processor of the entire XSL language, to format XML to PDF (Adobe's
document format). Keep watching the Java Developer Connection (requires
free registration), and Mozilla sites for the eventual announcements.
- "XTech '99: Java and the XML wave" by Mark Johnson
(JavaWorld, April 1999) offers the most current information on
the contest:
http://www.javaworld.com/javaworld/jw-04-1999/jw-04-xtech.html
Simple API for XML (SAX)
- The definitive description of SAX is available online. You can also
download free SAX software here:
http://www.megginson.com/SAX/index.html
Document Object Model (DOM)
- The W3C information page for the Document Object Model appears on
the W3C site:
http://www.w3c.org/DOM/
- Among other things, you'll find the W3C Recommendation for DOM Level
1:
http://www.w3.org/TR/REC-DOM-Level-1/
- The Java bindings for DOM, for both XML and HTML, are in this
Recommendation appendix:
http://www.w3.org/TR/REC-DOM-Level-1/java-language-binding.html
- A great DOM tutorial by William Robert Stanek appears on PC
Magazine Online in "Object-Based Web Design." This tutorial
includes a discussion of using DOM with IDL, CORBA's Interface
Definition Language:
http://www8.zdnet.com/pcmag/pctech/content/17/13/tf1713.001.html
Dynamic HTML
- The Dynamic HTML Resource page contains several links to DHTML
articles:
http://www.hotwired.com/webmonkey/dynamic_html/?tw=dynamic_html
Software
- Epicentric, Inc.:
http://www.epicentric.com/
- More XML (and other Java) technology than you can shake a stick at
is available at IBM's alphaWorks:
http://alphaworks.ibm.com/
- Version 2 of IBM's excellent XML parser package, xml4j, is available
for download. This package includes several parsers, both validating and
nonvalidating:
http://www.alphaworks.ibm.com/tech/xml4j
- See also IBM's exciting Bean Markup Language project, which uses XML
to represent and manipulate JavaBeans:
http://www.alphaworks.ibm.com/tech/bml
- Another free Java XML parser was written by the indefatiguable James
Clark, download at:
http://www.jclark.com/xml/xp/index.html
- XEENA is IBM alphaWorks's DTD-guided XML editor. You want it, you
need it, you gotta have it:
http://www.alphaworks.ibm.com/tech/xeena
- Mozilla.org is the open source community's effort to extend the
Netscape source code. Find out about it at:
http://www.mozilla.org/
- Information about XML and CSS in Mozilla appears at:
http://www.mozilla.org/rdf/doc/xml.html
- You can read about Sun's XML and Java initiatives at:
http://www.sun.com/990310/java_xml.jhtml
- In addition, Java Project X includes source code downloadable from:
http://developer.java.sun.com/developer/earlyAccess/xml/index.html
- ArborText has a suite of sophisticated tools for editing SGML, XML,
and XSL:
http://www.arbortext.com/Products/products.html
- Oracle8i from Oracle corporation uses XML inside the Oracle core:
http://www.oracle.com/xml/
- Download Oracle's free XML for Java parser:
http://technet.oracle.com/direct/3xml.htm
- Microsoft's Internet Explorer 5.0, released this month, implements
part of the XSL spec. You can find it on Microsoft's Web site -- and
also just about anywhere else:
http://www.microsoft.com/windows/ie/default.htm
- You can also download a beta release of Microsoft's XML Notepad
editor (limited to running only on Microsoft Windows):
http://www.microsoft.com/xml/notepad/download.asp
- Vervet Logic of Bloomington, IN, has announced XML <PRO>, a
commercial XML editor:
http://www.vervet.com/
- Majix, to transform XML to HTML via XSL, is available at:
http://www.tetrasix.com/
- If your French is rusty, you might want to try the English-language
site at:
http://www.tetrasix.com/english/default.htm
History
- Read about the history of HTML here. It's part of an online book, so
there's no telling for how long it will be available:
http://ei.cs.vt.edu/~wwwbtb/hardcopy/book/chap4/origins.html The
two chapters listed below (of the book "HTML Unleashed" by Rick Darnell,
et al., also cover some of the technical background of these languages.
- SGML history
http://www.webreference.com/dlab/books/html/3-2.html
- XML history (such as it is):
http://www.webreference.com/dlab/books/html/38-0.html
- Nothing to do on Friday night? Why not read up on the history of
SGML? Charles Goldfarb, considered by many to be the "father of SGML,"
reminisces publicly at:
http://www.sgmlsource.com/Goldfarb/history/index.htm
- Useful XML and SGML information appears at Goldfarb's Web site,
including a comprehensive XML book list:
http://www.sgmlsource.com/
Miscellaneous links
- Uche Ogbuji has written an interesting article in
LinuxWorld about using XML on Linux in the Enterprise. It's at:
http://www.linuxworld.com/linuxworld/lw-1999-03/lw-03-xml.html
- Bluestone Software has recently made a splash with pure-Java XML
application servers, and a freely downloadable Swing package called
XwingML:
http://www.bluestone.com/
- Everyone (except Microsoft) is pretty freaked out about the US
Patent Office awarding Microsoft a patent for certain kinds of
functionality in style sheets. What happens with this patent, and its
impact on developing technology, remains to be seen. Judge for yourself
by reading the patent at:
http://www.patents.ibm.com/patlist?icnt=US&patent_number=5860073
- The title of the sample recipe is actually the title of a very funny
song by William Bolcom. Similar recipes may be found at:
http://www.b4uby.com/granny/gsoup.htm
- The song appears on a compact disc (with other odd songs) available
from the Public Radio Music Source at:
http://75music.org/best/docs/keepers.htm
|
 |